Goto

Collaborating Authors

 metabolic process




Combining LLMs and Knowledge Graphs to Reduce Hallucinations in Question Answering

Pusch, Larissa, Conrad, Tim O. F.

arXiv.org Artificial Intelligence

Advancements in natural language processing have revolutionized the way we can interact with digital information systems, such as databases, making them more accessible. However, challenges persist, especially when accuracy is critical, as in the biomedical domain. A key issue is the hallucination problem, where models generate information unsupported by the underlying data, potentially leading to dangerous misinformation. This paper presents a novel approach designed to bridge this gap by combining Large Language Models (LLM) and Knowledge Graphs (KG) to improve the accuracy and reliability of question-answering systems, on the example of a biomedical KG. Built on the LangChain framework, our method incorporates a query checker that ensures the syntactical and semantic validity of LLM-generated queries, which are then used to extract information from a Knowledge Graph, substantially reducing errors like hallucinations. We evaluated the overall performance using a new benchmark dataset of 50 biomedical questions, testing several LLMs, including GPT-4 Turbo and llama3:70b. Our results indicate that while GPT-4 Turbo outperforms other models in generating accurate queries, open-source models like llama3:70b show promise with appropriate prompt engineering. To make this approach accessible, a user-friendly web-based interface has been developed, allowing users to input natural language queries, view generated and corrected Cypher queries, and verify the resulting paths for accuracy. Overall, this hybrid approach effectively addresses common issues such as data gaps and hallucinations, offering a reliable and intuitive solution for question answering systems. The source code for generating the results of this paper and for the user-interface can be found in our Git repository: https://git.zib.de/lpusch/cyphergenkg-gui


Geneverse: A collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research

Liu, Tianyu, Xiao, Yijia, Luo, Xiao, Xu, Hua, Zheng, W. Jim, Zhao, Hongyu

arXiv.org Artificial Intelligence

The applications of large language models (LLMs) are promising for biomedical and healthcare research. Despite the availability of open-source LLMs trained using a wide range of biomedical data, current research on the applications of LLMs to genomics and proteomics is still limited. To fill this gap, we propose a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for three novel tasks in genomic and proteomic research. The models in Geneverse are trained and evaluated based on domain-specific datasets, and we use advanced parameter-efficient finetuning techniques to achieve the model adaptation for tasks including the generation of descriptions for gene functions, protein function inference from its structure, and marker gene selection from spatial transcriptomic data. We demonstrate that adapted LLMs and MLLMs perform well for these tasks and may outperform closed-source large-scale models based on our evaluations focusing on both truthfulness and structural correctness. All of the training strategies and base models we used are freely accessible.


ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

Liu, Zhiyuan, Zhang, An, Fei, Hao, Zhang, Enzhi, Wang, Xiang, Kawaguchi, Kenji, Chua, Tat-Seng

arXiv.org Artificial Intelligence

Language Models (LMs) excel in understanding textual descriptions of proteins, as evident in biomedical question-answering tasks. However, their capability falters with raw protein data, such as amino acid sequences, due to a deficit in pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations, but struggle to process texts. To address their limitations, we introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding. ProtT3 empowers an LM to understand protein sequences of amino acids by incorporating a PLM as its protein understanding module, enabling effective protein-to-text generation. This collaboration between PLM and LM is facilitated by a cross-modal projector (i.e., Q-Former) that bridges the modality gap between the PLM's representation space and the LM's input space. Unlike previous studies focusing on protein property prediction and protein-text retrieval, we delve into the largely unexplored field of protein-to-text generation. To facilitate comprehensive benchmarks and promote future research, we establish quantitative evaluations for protein-text modeling tasks, including protein captioning, protein question-answering, and protein-text retrieval. Our experiments show that ProtT3 substantially surpasses current baselines, with ablation studies further highlighting the efficacy of its core components. Our code is available at https://github.com/acharkq/ProtT3.


Temporal Causal Mediation through a Point Process: Direct and Indirect Effects of Healthcare Interventions

Hızlı, Çağlar, John, ST, Juuti, Anne, Saarinen, Tuure, Pietiläinen, Kirsi, Marttinen, Pekka

arXiv.org Artificial Intelligence

Deciding on an appropriate intervention requires a causal model of a treatment, the outcome, and potential mediators. Causal mediation analysis lets us distinguish between direct and indirect effects of the intervention, but has mostly been studied in a static setting. In healthcare, data come in the form of complex, irregularly sampled time-series, with dynamic interdependencies between a treatment, outcomes, and mediators across time. Existing approaches to dynamic causal mediation analysis are limited to regular measurement intervals, simple parametric models, and disregard long-range mediator--outcome interactions. To address these limitations, we propose a non-parametric mediator--outcome model where the mediator is assumed to be a temporal point process that interacts with the outcome process. With this model, we estimate the direct and indirect effects of an external intervention on the outcome, showing how each of these affects the whole future trajectory. We demonstrate on semi-synthetic data that our method can accurately estimate direct and indirect effects. On real-world healthcare data, our model infers clinically meaningful direct and indirect effect trajectories for blood glucose after a surgery.


Latest Machine Learning Research Uncovers a Hidden Order in Scents

#artificialintelligence

Alex Wiltschko is an olfactory neuroscientist for Google Research's Brain Team. He recently employed machine learning to analyze their oldest and least known sense of smell. Their discoveries considerably increased scientists' capacity to determine a molecule's scent from its structure. Over 800 chemicals reach your nose when you smell coffee. Our brains create the general impression of coffee from this chemical image.


AI Model Links Smell Molecules With Metabolic Processes

#artificialintelligence

Alex Wiltschko began collecting perfumes as a teenager. His first bottle was Azzaro Pour Homme, a timeless cologne he spotted on the shelf at a T.J. Maxx department store. He recognized the name from Perfumes: The Guide, a book whose poetic descriptions of aroma had kick-started his obsession. Enchanted, he saved up his allowance to add to his collection. "I ended up going absolutely down the rabbit hole," he said.


Inferring Microbial Biomass Yield and Cell Weight using Probabilistic Macrochemical Modeling

Paiva, Antonio R., Pilloni, Giovanni

arXiv.org Machine Learning

Growth rates and biomass yields are key descriptors used in microbiology studies to understand how microbial species respond to changes in the environment. Of these, biomass yield estimates are typically obtained using cell counts and measurements of the feed substrate. These quantities are perturbed with measurement noise however. Perhaps most crucially, estimating biomass from cell counts, as needed to assess yields, relies on an assumed cell weight. Noise and discrepancies on these assumptions can lead to significant changes in conclusions regarding a microbes' response. This article proposes a methodology to address these challenges using probabilistic macrochemical models of microbial growth. It is shown that a model can be developed to fully use the experimental data, greatly relax the assumptions on the cell weight, and provides uncertainty estimates of key parameters. These capabilities are demonstrated and validated herein using several case studies with synthetically generated microbial growth data.


Understanding metabolic processes through machine learning

#artificialintelligence

Bioinformatics researchers at Heinrich Heine University Düsseldorf (HHU) and the University of California at San Diego (UCSD) are using machine learning techniques to better understand enzyme kinetics and thus also complex metabolic processes. The team led by first author Dr. David Heckmann has described its results in the current issue of the journal Nature Communications. The synthetic life sciences rely on a detailed and quantitative understanding of the complex systems in biological cells. Only if such systems are understood is their targeted manipulation possible. A system already well known is biological metabolism, in which many hundred enzymes are involved.